System and method for reconstructing client web page accesses from captured network packets

ABSTRACT

According to one embodiment of the present invention, a method for reconstructing client web page accesses is provided that comprises capturing network-level information for client accesses of at least one web page, and using the captured network-level information to reconstruct client accesses of the at least one web page. Another embodiment of the present invention provides a method for reconstructing client information accesses. The method comprises capturing network-level information for client accesses of information from a server, wherein each client access of the information comprises a plurality of transactions. The method further comprises relating the plurality of transactions to their corresponding client access of information from the server.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No.10/147,619, now U.S. Pat. No. 7,246,101, filed May 16, 2002 now U.S.Pat. No. 7,246,101, entitled “KNOWLEDGE-BASED SYSTEM AND METHOD FORRECONSTRUCTING CLIENT WEB PAGE ACCESSES FROM CAPTURED NETWORK PACKETS”,U.S. patent application Ser. No. 10/147,249, U.S. Patent ApplicationPublication No. 2003/0217130, filed May 16, 2002 now U.S. Pat. No.7,437,451, entitled “SYSTEM AND METHOD FOR COLLECTING DESIREDINFORMATION FOR NETWORK TRANSACTIONS AT THE KERNEL LEVEL”, U.S. patentapplication Ser. No. 10/146,988, U.S. Patent Application Publication No.2005/0076111, filed May 16, 2002 entitled “SYSTEM AND METHOD FORRELATING ABORTED CLIENT ACCESSES OF DATA TO QUALITY OF SERVICE PROVIDEDBY A SERVER IN A CLIENT-SERVER NETWORK”, and U.S. patent applicationSer. No. 10/146,967, U.S. Patent Application Publication No.2003/0221000, filed May 16, 2002, entitled “SYSTEM AND METHOD FORMEASURING WEB SERVICE PERFORMANCE USING CAPTURED NETWORK PACKETS”, thedisclosures of which are hereby incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates in general to client-server networks, andmore specifically to a system and method for reconstructing a clientaccess of a web service from network packets captured on the web serverside of a client-server network.

BACKGROUND OF THE INVENTION

Today, Internet services are delivering a large array of business,government, and personal services. Similarly, mission criticaloperations, related to scientific instrumentation, military operations,and health services, are making increasing use of the Internet fordelivering information and distributed coordination. For example, manyusers are accessing the Internet seeking such services as personalshopping, airline reservations, rental car reservations, hotelreservations, on-line auctions, on-line banking, stock market trading,as well as many other services being offered via the Internet. Manycompanies are providing such services via the Internet, and aretherefore beginning to compete in this forum. Accordingly, it isimportant for such service providers (sometimes referred to as “contentproviders”) to provide high-quality services.

One measure of the quality of service provided by service providers isthe end-to-end performance characteristic. The end-to-end performanceperceived by clients is a major concern of service providers. Ingeneral, the end-to-end performance perceived by a client is ameasurement of the time from when the client requests a service (e.g., aweb page) from a service provider to the time when the client fullyreceives the requested service. For instance, if a client requests toaccess a service provided by a service provider, and it takes severalminutes for the service to be downloaded from the service provider tothe client, the client may consider the quality of the service as beingpoor because of its long download time. In fact, the client may be tooimpatient to wait for the service to fully load and may instead attemptto obtain the service from another provider. Currently, most websiteproviders set a target client-perceived end-to-end time of less than sixseconds for their web pages. That is, website providers typically liketo provide their requested web pages to a client in less than sixseconds from the time the client requests the page.

A popular client-server network is the Internet. The Internet is apacket-switched network, which means that when information is sentacross the Internet from one computer to another, the data is brokeninto small packets. A series of switches called routers send each packetacross the network individually. After all of the packets arrive at thereceiving computer, they are recombined into their original, unifiedform. TCP/IP is a protocol commonly used for communicating the packetsof data. In TCP/IP, two protocols do the work of breaking the data intopackets, routing the packets across the Internet, and then recombiningthem on the other end: 1) the Internet Protocol (IP), which routes thedata, and 2) the Transmission Control Protocol (TCP), which breaks thedata into packets and recombines them on the computer that receives theinformation. TCP/IP is well known in the existing art, and therefore isnot described in further detail herein.

One popular part of the Internet is the World Wide Web (which may bereferred to herein simply as the “web”). Computers (or “servers”) thatprovide information on the web are typically called “websites.” Servicesoffered by service providers' websites are obtained by clients via theweb by downloading web pages from such websites to a browser executingon the client. For example, a user may use a computer (e.g., personalcomputer, laptop computer, workstation, personal digital assistant,cellular telephone, or other processor-based device capable of accessingthe Internet) to access the Internet (e.g., via a conventional modem,cable modem, Digital Subscriber Line (DSL) connection, or the like). Abrowser, such as NETSCAPE NAVIGATOR developed by NETSCAPE, INC. orMICROSOFT INTERNET EXPLORER developed by MICROSOFT CORPORATION, asexamples, may be executing on the user's computer to enable a user toinput information requesting to access a particular website and tooutput information (e.g., web pages) received from an accessed website.

In general, a web page is typically composed of a mark-up language file,such as a HyperText Mark-up Language (HTML), Extensible Mark-up Language(XML), Handheld Device Mark-up Language (HDML), or Wireless Mark-upLanguage (WML) file, and several embedded objects, such as images. Abrowser retrieves a web page by issuing a series of HyperText TransferProtocol (HTTP) requests for all objects. As is well known, HTTP is theunderlying protocol used by the World Wide Web. The HTTP requests can besent through one persistent TCP connection or multiple concurrentconnections.

As described above, service providers often desire to have anunderstanding of their end-to-end performance characteristics.Effectively monitoring and characterizing the end-to-end behavior of webtransactions is important for evaluating and/or improving the web siteperformance and selecting the proper web site architecture for a serviceprovider to implement. Because in this forum the client-perceivedwebsite responses are downloaded web pages, the performance related toweb page downloading is one of the critical elements in evaluatingend-to-end performance. However, the nature of the Internet and themanner in which services are provided via the web result in difficultyin acquiring meaningful performance measurements. For instance, the besteffort nature of Internet data delivery, changing client and networkconnectivity characteristics, and the highly complex architectures ofmodern Internet services makes it very difficult to understand theperformance characteristics of Internet services. In a competitivelandscape, such understanding is critical to continually evolving andengineering Internet services to match changing demand levels and clientpopulations.

Two popular techniques exist in the prior art for benchmarking theperformance of Internet services: 1) the active probing technique, and2) the web page instrumentation technique. The active probing techniqueuses machines from fixed points in the Internet to periodically requestone or more Uniform Resource Locators (URLs) from a target web service,record end-to-end performance characteristics, and report a time-varyingsummary back to the web service. For example, in an active probingtechnique, artificial clients may be implemented at various fixed points(e.g., at fifty different points) within a network, and such artificialclients may periodically (e.g., once every hour or once every 15minutes) request a particular web page from a website and measure theend-to-end performance for receiving the requested web page at therequesting artificial client. A number of companies use active probingtechniques to offer measurement and testing services, including KEYNOTESYSTEMS, INC., NETMECHANIC, INC., SOFTWARE RESEARCH INC., and PORIVOTECHNOLOGIES, INC.

The active probing techniques are based on periodic polling of webservices using a set of geographically distributed, synthetic clients.In general, only a few pages or operations can typically be tested,potentially reflecting only a fraction of all user's experience with theservices of a given web service provider. Further, active probingtechniques typically cannot capture the potential benefits of browser'sand network caches, in some sense reflecting “worst case” performance.From another perspective, active probes comprise a different set ofmachines than those that actually access the service. For example, theartificial clients used for probing a website may comprise differenthardware and/or different network connectivity than that of typical endusers of the website. For instance, most users of a particular websitemay have a dial-up modem connection (e.g., using a 56 kilobyte modem) tothe Internet, while the artificial clients used for probing may havedirect connections, cable modem connections, Integrated Services DigitalNetwork (ISDN) connections, or Digital Subscriber Line (DSL)connections. Thus, there may not always be correlation in theperformance/reliability reported by the probing service and thatexperienced by actual end users.

Finally, it is difficult to determine the breakdown between network andserver-side performance using active probing, making it difficult forservice providers to determine where best to place their optimizationefforts. That is, active probing techniques indicate the end-to-endperformance measurement for a web page, but it does not indicate theamount of latency that is attributable to the web server and the amountof latency that is attributable to the network. For instance, a serviceprovider may be unable to alter the latency caused by congestion on thenetwork, but the service provider may be able to evaluate and improveits server's performance if much of the latency is due to the server(e.g., by decreasing the number of processes running on the server,re-designing the web page, altering the web server's architecture,etc.).

The second technique for measuring performance, the web pageinstrumentation technique, associates code (e.g., JAVASCRIPT) withtarget web pages. The code, after being downloaded into the clientbrowser, tracks the download time for individual objects and reportsperformance characteristics back to the web site. That is, in thistechnique, instrumentation code embedded in web pages and downloaded tothe client is used to record access times and report statistics back tothe server. For example, a web page may be coded to include instructionsthat are executable to measure the download time for objects of the webpage. Accordingly, when a user requests the web page, the codedinstrumentation portion of the web page may first be downloaded to theclient, and such instrumentation may execute to measure the time for theclient receiving each of the other objects of the web page.

As an example, WEB TRANSACTION OBSERVER (WTO) from HEWLETT PACKARD'SOPENVIEW suite uses JAVASCRIPT to implement such a web pageinstrumentation technique. With additional web server instrumentationand cookie techniques, this product can record the server processingtime for a request, enabling a breakdown between server and networkprocessing time. A number of other products and proposals employ similartechniques, such as the TIVOLI WEB MANAGEMENT SOLUTIONS available fromIBM CORPORATION, CANDLE CORPORATION'S EBUSINESS ASSURANCE, and“Measuring Client-Perceived Response Times on the WWW” by R. Rajamonyand M. Elnozahy at Proceedings of the Third USENIX Symposium on InternetTechnologies and Systems (USITS), March 2001, San Francisco.

Because the web page instrumentation technique downloads instrumentationcode to actual clients, this technique can capture end-to-endperformance information from real clients, as opposed to capturingend-to-end performance information for synthetic (or “artificial”)clients, as with the above-described active probing techniques. However,the web page instrumentation technique fails to capture connectionestablishment times (because the instrumentation code is not downloadedto a client until after the connection has been established), which arepotentially an important aspect of overall performance. Further, thereis a certain amount of resistance in the industry to the web pageinstrumentation technique. The web page instrumentation techniquerequires additional server-side instrumentation and dedicated resourcesto actively collect performance reports from clients. For example, addedinstrumentation code is required to be included in a web page to bemonitored, thus increasing the complexity associated with coding suchweb page and introducing further potential for coding errors that may bepresent in the web page (as well as further code maintenance that may berequired for the web page).

BRIEF SUMMARY OF THE INVENTION

According to one embodiment of the present invention, a method forreconstructing client web page accesses is provided that comprisescapturing network-level information for client accesses of at least oneweb page, and using the captured network-level information toreconstruct client accesses of the at least one web page. Anotherembodiment of the present invention provides a method for reconstructingclient information accesses. The method comprises capturingnetwork-level information for client accesses of information from aserver, wherein each client access of the information comprises aplurality of transactions. The method further comprises relating theplurality of transactions to their corresponding client access ofinformation from the server.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, reference isnow made to the following descriptions taken in conjunction with theaccompanying drawing, in which:

FIG. 1 shows an example client-server system in which embodiments of thepresent invention may be implemented;

FIG. 2 shows the well-known Open System Interconnection (OSI) model fora network framework;

FIG. 3 shows a block diagram of a preferred embodiment of the presentinvention for reconstructing client accesses of web pages;

FIG. 4 shows an example operational flow diagram of a preferredembodiment of the present invention;

FIG. 5 shows a block diagram of reconstructing web accesses fromcaptured HTTP transactions;

FIG. 6 shows an example operational flow diagram for relating thetransactions of a Transaction Log to web page accesses in accordancewith a preferred embodiment of the present invention;

FIG. 7 shows an example hash table (or Web Page Session Log) forgrouping transactions into their corresponding web page accesses inaccordance with a preferred embodiment of the present invention;

FIG. 8 shows an example operational flow diagram for grouping web pageobjects (or transactions) into their corresponding client web pageaccesses to form a Web Page Session Log in accordance with a preferredembodiment of the present invention;

FIG. 9 shows an example operational flow diagram for determining thecontent of a web page in accordance with a preferred embodiment of thepresent invention;

FIG. 10 shows an example of a solution for reconstructing web pageaccesses that is deployed as a network appliance;

FIG. 11 shows an example of a solution for reconstructing web pageaccesses that is deployed as software on a web server;

FIG. 12 shows an example of a solution for reconstructing web pageaccesses in which a portion of the solution is deployed as software on aweb server and a portion of the solution is deployed as software on anindependent node; and

FIG. 13 shows an example computer system on which embodiments of thepresent invention may be implemented.

DETAILED DESCRIPTION OF THE INVENTION

As described above, service providers in a client-server network (e.g.,website providers) often desire to have an understanding of theirclient-perceived end-to-end performance characteristics. In the webforum, the client-perceived end-to-end performance is theclient-perceived time for downloading a requested web page from awebsite. Accordingly, the performance related to web page downloading isone of the critical elements in evaluating end-to-end performance ofwebsite providers. As described above, a web page is generally composedof a mark-up language file, such as an HTML, XML, HDML, or WML file, andseveral embedded objects, such as images. A browser executing at aclient computer retrieves a web page by issuing a series of HTTPrequests for all objects of the desired web page. The HTTP requests canbe sent through one persistent TCP connection or multiple concurrentconnections. However, HTTP does not provide any means to delimit thebeginning or the end of a web page, which makes it difficult for theserver to determine the client-perceived end-to-end performance.Accordingly, a desire exists for the capability to reconstruct a webpage access in order, for example, to measure the client-perceivedend-to-end performance in serving up such web page.

Turning to FIG. 1, an example client-server system 100 is shown in whichembodiments of the present invention may be implemented. As shown, oneor more servers 101A-101D may provide services to one or more clients,such as clients A-C (labeled 104A-104C, respectively), via communicationnetwork 103. Communication network 103 is preferably a packet-switchednetwork, and in various implementations may comprise, as examples, theInternet or other Wide Area Network (WAN), an Intranet, Local AreaNetwork (LAN), wireless network, Public (or private) Switched TelephonyNetwork (PSTN), a combination of the above, or any other communicationsnetwork now known or later developed within the networking arts thatpermits two or more computers to communicate with each other.

In a preferred embodiment, servers 101A-101D comprise web servers thatare utilized to serve up web pages to clients A-C via communicationnetwork 103 in a manner as is well known in the art. Accordingly, system100 of FIG. 1 illustrates an example of servers 101A-101D serving up webpages, such as web page 102, to requesting clients A-C. Of course,embodiments of the present invention are not limited in application toreconstructing web page accesses within a web environment, but mayinstead be implemented for reconstructing many other types of clientaccesses of a server. For example, clients may access server(s) invarious other types of client-server environments in order to receiveinformation from such server(s). Further, the information may becommunicated from the server(s) to the clients through a plurality oftransactions (e.g., a plurality of requests/responses) via apacket-switched network. Embodiments of the present invention may beimplemented to utilize network packets captured during the transactionsbetween the client and server to reconstruct the client accesses of theserver. For instance, embodiments of the present invention may beimplemented to group the corresponding transactions for communicatingparticular information (e.g., a web page) to a client in order toreconstruct the client's access of such particular information from theserver. Accordingly, the reconstructed access may, in certainimplementations, be utilized to measure the client-perceived end-to-endperformance in receiving the particular information from the server.

In the example of FIG. 1, web page 102 comprises an HTML (or othermark-up language) file 102A (which may be referred to herein as a “mainpage”), and several embedded objects (e.g., images, etc.), such asObject₁ and Object₂. Techniques for serving up such web page 102 torequesting clients A-C are well known in the art, and therefore suchtechniques are only briefly described herein. In general, a browser,such as browsers 105A-105C, may be executing at a client computer, suchas clients A-C. To retrieve a desired web page 102, the browser issues aseries of HTTP requests for all objects of the desired web page. Forinstance, various client requests and server responses are communicatedbetween client A and server 101A in serving web page 102 to client A,such as requests/responses 106A-106F (referred to collectively herein asrequests/responses 106). Requests/responses 106 provide a simplifiedexample of the type of interaction typically involved in serving adesired web page 102 from server 101A to client A. As those of skill inthe art will appreciate, requests/responses 106 do not illustrate allinteraction that is involved through TCP/IP communication for serving aweb page to a client, but rather provides an illustrative example of thegeneral interaction between client A and server 101A in providing webpage 102 to client A.

When a client clicks a hypertext link (or otherwise requests a URL) toretrieve a particular web page, the browser first establishes a TCPconnection with the web server by sending a SYN packet (not shown inFIG. 1). If the server is ready to process the request, it accepts theconnection by sending back a second SYN packet (not shown in FIG. 1)acknowledging the client's SYN. At this point, the client is ready tosend HTTP requests 106 to retrieve the HTML file 102A and all embeddedobjects (e.g., Object₁ and Object₂), as described below.

First, client A makes an HTTP request 106A to server 101A for web page102 (e.g., via client A's browser 105A). Such request may, as examples,be in response to a user inputting the URL for web page 102 or inresponse to a user clicking on a hyperlink to web page 102. Server 101Areceives the HTTP request 106A and sends HTML file 102A (e.g., file“index.html”) of web page 102 to client A via response 106B. HTML file102A typically identifies the various objects embedded in web page 102,such as Object₁ and Object₂. Accordingly, upon receiving HTML file 102A,browser 105A requests the identified objects, Object₁ and Object₂, viarequests 106C and 106E. Upon server 101A receiving the requests for suchobjects, it communicates each object individually to client A viaresponses 106D and 106F, respectively.

Again, the above interactions are simplified to illustrate the generalnature of requesting a web page, from which it should be recognized thateach object of a web page is requested individually by the requestingclient and is, in turn, communicated individually from the server to therequesting client. The above requests/responses 106 may each comprisemultiple packets of data. Further, the HTTP requests can, in certainimplementations, be sent from a client through one persistent TCPconnection with server 101A, or, in other implementations, the requestsmay be sent through multiple concurrent connections. Server 101A mayalso be accessed by other clients, such as clients B and C of FIG. 1,and various web page objects may be communicated in a similar manner tothose clients through packet communication 107 and 108, respectively.

In general, the client-perceived end-to-end performance for receivingweb page 102 is measured from the time that the client requests web page102 to the time that the client receives all objects of the web page(i.e., receives the full page). However, HTTP does not provide any meansto delimit the beginning or the end of a web page. For instance, HTTP isa stateless protocol in which each HTTP request is executedindependently without any knowledge of the requests that came before it.Accordingly, it is difficult at a server side 101A to reconstruct a webpage access for a given client without parsing the original HTML file.

Embodiments of the present invention enable a passive technique forreconstructing web page accesses from captured network-levelinformation. That is, network packets acquired by a network-levelcapture tool, such as the well-known UNIX tcpdump tool, may be used todetermine (or reconstruct) a client's web page access. From suchreconstruction of the client's web page access, the client-perceivedend-to-end response time for a web page download may be determined.Thus, various embodiments of the present invention enable a passive,end-to-end monitor that is operable to reconstruct client web pageaccesses using captured network-level information.

The “network level” may be better understood with reference to thewell-known Open System Interconnection (OSI) model, which defines anetworking framework for implementing protocols in seven layers. The OSImodel is a teaching model that identifies functionality that istypically present in a communication system, although in someimplementations two or three OSI layers may be incorporated into one.The seven layers of the OSI model are briefly described hereafter inconjunction with FIG. 2. According to the OSI model, data 203 iscommunicated from computer (e.g., server) 201 to computer (e.g., client)202 through the various layers. That is, control is passed from onelayer to the next, starting at the application layer 204 in computer201, proceeding to the bottom layer, over the channel to computer 202,and back up the hierarchy.

In general, application layer 204 supports application and end-userprocesses. Communication partners are identified, quality of service isidentified, user authentication and privacy are considered, and anyconstraints on data syntax are identified. This layer providesapplication services for file transfers, e-mail, and other networksoftware services. For example, a client browser executes in applicationlayer 204.

According to the OSI model, presentation layer 205 provides independencefrom differences in data representation (e.g., encryption) bytranslating from application to network format, and vice versa.Presentation layer 205 works to transform data into the form that theapplication layer 204 can accept. This layer 205 formats and encryptsdata to be sent across a network, providing freedom from compatibilityproblems. It is sometimes called the “syntax layer.”

Session layer 206 of the OSI model establishes, manages and terminatesconnections between applications. Session layer 206 sets up,coordinates, and terminates conversations, exchanges, and dialoguesbetween the applications at each end of the communication. It deals withsession and connection coordination. Transport layer 207 of the OSImodel provides transparent transfer of data between end systems, orhosts, and is responsible for end-to-end error recovery and flowcontrol. It ensures complete data transfer.

Network layer 208 of the OSI model provides switching and routingtechnologies, creating logical paths, such as virtual circuits, fortransmitting data from node to node. Thus, routing and forwarding arefunctions of this layer 208, as well as addressing, internetworking,error handling, congestion control, and packet sequencing.

At data link layer 209 of the OSI model, data packets are encoded anddecoded into bits. This layer furnishes transmission protocol knowledgeand management and handles errors in the physical layer 210, flowcontrol and frame synchronization. Data link layer 209 may be dividedinto two sublayers: 1) the Media Access Control (MAC) sublayer, and 2)the Logical Link Control (LLC) sublayer. The MAC sublayer controls how acomputer on the network gains access to the data and permission totransmit it. The LLC sublayer controls frame synchronization, flowcontrol and error checking.

Physical layer 210 conveys the bit stream (e.g., electrical impulse,light, or radio signal) through the communication network at theelectrical and mechanical level. It provides the hardware means ofsending and receiving data on a carrier, including defining cables,cards and physical aspects. Fast Ethernet, RS232, and ATM are exampleprotocols with components of physical layer 210.

As described above, one technique for measuring client-perceivedend-to-end performance is the web page instrumentation technique inwhich instrumentation code is included in a web page and is downloadedfrom the server to a client. More specifically, in this technique, theweb page instrumentation code for a web page is downloaded from a serverto the client and is executed by the client's browser (in theapplication layer 204) to measure the end-to-end time for downloadingthe web page to the client. Accordingly, such web page instrumentationtechnique captures information at the application layer 204 formeasuring client-perceived end-to-end performance. As described furtherbelow, embodiments of the present invention utilize information capturedat the network layer 208 to reconstruct client web page accesses,thereby eliminating the requirement of including instrumentation in aweb page for measuring end-to-end performance. Thus, embodiments of thepresent invention enable a server (or other node(s) properly positionedon the communication network) to reconstruct information regardingclient web page accesses from captured network layer information (e.g.,captured network packets).

Another technique utilized in the prior art for measuring end-to-endperformance is the active probing technique. As described above, theactive probing technique utilizes artificial clients to actively probe aparticular web page (i.e., by actively accessing the particular webpage) on a periodic basis and measure the response time for receivingthe requested web page. As described further below, embodiments of thepresent invention provide a passive technique that is capable ofutilizing captured network-level information to reconstruct actualclient web page accesses. Accordingly, rather than actively probing webpages from artificial clients, embodiments of the present inventionenable passive monitoring of web page accesses by actual clients to, forexample, measure the client-perceived end-to-end performance for suchweb pages.

Accordingly, embodiments of the present invention enable actual clientweb page accesses to be reconstructed without requiring instrumentationcode to be included in a web page for monitoring a client's access ofsuch web page (as is required in the web page instrumentationtechnique). Also, embodiments of the present invention enable actualclient web page accesses to be reconstructed, as opposed to monitoringartificial clients as in the active probing technique. Further,embodiments of the present invention provide a passive monitoringtechnique that enables actual network-level information (e.g., packets)to be captured and used for reconstructing client web page accesses, asopposed to actively probing web pages as in the active probingtechnique. Thus, a web page provider may utilize an embodiment of thepresent invention to passively reconstruct web page accesses (e.g., tomeasure the client-perceived end-to-end performance for such accesses)through captured network-level information from the actual clientaccesses, rather than actively accessing the web page from “test”client(s) in order to measure the end-to-end performance perceived bythe “test” client(s).

A block diagram of a preferred embodiment for reconstructing clientaccesses of web pages is shown in FIG. 3. As shown, a preferredembodiment comprises network packets collector module 301,request-response reconstructor module 302 (which may be referred toherein as transaction reconstructor module 302), and web page accessreconstructor module 303. As described further hereafter, performanceanalysis module 304 may be included in certain implementations formeasuring client-perceived end-to-end performance for the reconstructedweb page accesses, thereby forming a passive end-to-end monitor.

Network packets collector module 301 is operable to collectnetwork-level information that is utilized to reconstruct web pageaccesses. In a preferred embodiment, network packets collector module301 utilizes a tool to capture network packets, such as the publiclyavailable UNIX tool known as “tcpdump” or the publicly available WINDOWStool known as “WinDump.” The software tools “tcpdump” and “WinDump” arewell-known and are commonly used in the networking arts for capturingnetwork-level information for network “sniffer/analyzer” applications.Typically, such tools are used to capture network-level information formonitoring security on a computer network (e.g., to detect unauthorizedintruders, or “hackers”, in a system). Of course, other tools now knownor later developed for capturing network-level information, or at leastthe network-level information utilized by embodiments of the presentinvention, may be utilized in alternative embodiments of the presentinvention.

Network packets collector module 301 records the captured network-levelinformation (e.g., network packets) to a Network Trace file 301A. Thisapproach allows the Network Trace 301A to be processed in offline mode.For example, tcpdump may be utilized to capture many packets (e.g., amillion packets) for a given period of time (e.g., over the course of aday), which may be compiled in the Network Trace 301A. Thereafter, suchcollected packets in the Network Trace 301A may be utilized byrequest-response reconstructor module 302 in the manner describedfurther below. While a preferred embodiment utilizes a tool, such astcpdump, to collect network information for offline processing, knownprogramming techniques may be used, in alternative embodiments, toimplement a real-time network collection tool. If such a real-timenetwork collection tool is implemented in network packets collectormodule 301, the various other modules of FIG. 3 may be similarlyimplemented to use the real-time captured network information toreconstruct web page accesses (e.g., in an on-line mode of operation).

From the Network Trace 301A, request-response reconstructor module 302reconstructs all TCP connections and extracts HTTP transactions (e.g., arequest with the corresponding response) from the payload of thereconstructed TCP connections. More specifically, in one embodiment,request-response reconstructor module 302 rebuilds the TCP connectionsfrom the Network Trace 301A using the client IP addresses, client portnumbers and the request (response) TCP sequence numbers. Within thepayload of the rebuilt TCP connections, the HTTP transactions may bedelimited as defined by the HTTP protocol. Meanwhile, the timestamps,sequence numbers and acknowledged sequence numbers may be recorded forthe corresponding beginning or end of HTTP transactions. Afterreconstructing the HTTP transactions, request-response reconstructormodule 302 may extract HTTP header lines from the transactions. The HTTPheader lines are preferably extracted from the transactions because thepayload does not contain any additional useful information forreconstructing web page accesses, but the payload requires approximatelytwo orders of magnitude of additional storage space. The resultingoutcome of extracting the HTTP header lines from the transactions isrecorded to a Transaction Log 302A, which is described further below.That is, after obtaining the HTTP transactions, request-responsereconstructor module 302 stores some HTTP header lines and other relatedinformation from Network Trace 301A in Transaction Log 302A for futureprocessing (preferably excluding the redundant HTTP payload in order tominimize storage requirements).

A methodology for rebuilding HTTP transactions from TCP-level traces wasproposed by Anja Feldmann in “BLT: Bi-Layer Tracing of HTTP and TCP/IP”,Proceedings of WWW-9, May 2000, the disclosure of which is herebyincorporated herein by reference. Balachander Krishnamurthy and JenniferRexford explain this mechanism in more detail and extend this solutionto rebuild HTTP transactions for persistent connections in “WebProtocols and Practice: HTTP/1.1, Networking Protocols, Caching, andTraffic Measurement” pp. 511-522, Addison Wesley, 2001, the disclosureof which is also hereby incorporated herein by reference. Accordingly,in a preferred embodiment of the present invention, request-responsereconstructor module 302 uses such methodology for rebuilding HTTPtransactions from TCP-level traces.

In an alternative embodiment, Transaction Log 302A may be generated in akernel-level module implemented on the server as described in greaterdetail in U.S. patent application Ser. No. 10/147,249 now U.S. Pat. No.7,437,451 titled “SYSTEM AND METHOD FOR COLLECTING DESIRED INFORMATIONFOR NETWORK TRANSACTIONS AT THE KERNEL LEVEL,” the disclosure of whichis incorporated herein by reference. Such alternative embodiment may bedesired because, for example, it enables information for transactions tobe collected at the kernel level of a server (e.g., a web server), whichmay avoid rebuilding the transactions at the user level as in themethodology proposed by Anja Feldmann. Such alternative embodiment mayenable greater computing efficiency in generating Transaction Log 302Abecause the transactions are not required to be reconstructed at theuser level, and/or it may require less storage space because only thedesired information for transactions may be communicated from the kernellevel to the user level as opposed to the raw network information ofNetwork Trace 301A being stored at the user level (which may includemuch more information than is desired for each transaction), asdescribed further in the above-referenced U.S. Patent Application“SYSTEM AND METHOD FOR COLLECTING DESIRED INFORMATION FOR NETWORKTRANSACTIONS AT THE KERNEL LEVEL.”

As described above, a web page is generally composed of one HTML fileand some embedded objects, such as images or JAVASCRIPTS. When a clientrequests a particular web page, the client's browser should retrieve allthe page-embedded images from a web server in order to display therequested page. The client browser retrieves each of these embeddedimages separately. As illustrated by the generic example of FIG. 1, eachobject of a requested web page is retrieved from a server by anindividual HTTP request made by the client. An HTTP request-responsepair may be referred to collectively herein as an HTTP “transaction.”Entries of Transaction Log 302A contain information about theseindividual HTTP transactions (i.e., requests/responses).

Once information about various individual HTTP transactions is collectedin Transaction Log 302A (e.g., either from Network Trace 301A or from akernel level module), the next step in reconstructing a web page accessis to relate the different individual HTTP transactions in the sessionscorresponding to a particular web page access. That is, the variousdifferent HTTP transactions collected in Transaction Log 302A arerelated together as logical web pages. In a preferred embodiment, webpage access reconstructor module 303 is responsible for grouping theunderlying physical object retrievals together into logical web pages,and stores them in Web Page Session Log 303A. More specifically, webpage access reconstructor module 303 analyzes Transaction Log 302A andgroups the various different HTTP transactions that correspond to acommon web page access. Thus, Web Page Session Log 303A comprises theHTTP transactions organized (or grouped) into logical web page accesses.

After different request-response pairs (i.e., HTTP transactions) aregrouped into web page retrieval sessions in Web Page Session Log 303A,performance analysis module 304 may, in certain implementations, beutilized to measure the client-perceived end-to-end response time for aweb page download. That is, once the HTTP transactions for a common webpage access are grouped together in Web Page Session Log 303A, suchgrouping of HTTP transactions may be utilized to measure theclient-perceived end-to-end performance for each reconstructed web pageaccess.

Turning to FIG. 4, an example operational flow diagram of a preferredembodiment of the present invention is shown. As shown, in operationalblock 401, network packets are collected by the network packetscollector module 301 during client web page accesses in order to formNetwork Trace 301A. More specifically, network-level information may becaptured in real-time (i.e., as client web page accesses are occurring)using, for example, tcpdump or other network information capture tool.The captured information may be stored in Network Trace 301A for later,off-line evaluation thereof.

In operational block 402, request-response reconstructor module 302rebuilds the TCP connections from the Network Trace 301A. Thereafter, inoperational block 403, request-response reconstructor module 302extracts HTTP transactions from the payload of the rebuilt TCPconnections to form Transaction Log 302A. Web page access reconstructormodule 303 retrieves information from Transaction Log 302A, inoperational block 404, and relates the HTTP transactions stored inTransaction Log 302A to their corresponding web page accesses. That is,in operational block 404, web page access reconstructor module 303relates each of the transactions stored in Transaction Log 302A to therespective web page access to which such transaction corresponds,thereby forming web page session log 303A that comprises reconstructedclient web page accesses. Once the web page accesses are reconstructed,end-to-end performance analysis (or other performance measurements) may,in certain implementations, be performed for such web page accesses inoperational block 405 (shown in dashed-line in FIG. 4 as being optionalin this example). Examples of performance analysis that may be performedon such reconstructed web page accesses are described further in U.S.patent application Ser. No. 10/146,967 entitled “SYSTEM AND METHOD FORMEASURING WEB SERVICE PERFORMANCE USING CAPTURED NETWORK PACKETS”, thedisclosure of which is incorporated herein by reference.

Thus, embodiments of the present invention capture HTTP transactions andrelate such transactions to reconstruct client web page accesses(operational block 404 of FIG. 4). That is, embodiments of the presentinvention determine the one(s) of various captured HTTP transactionsthat correspond to a common web page access and group those transactionstogether. Once the various transactions that comprise a web page accessare grouped together, those transactions may be evaluated to, forexample, determine the client-perceived end-to-end performance of theweb page access to which those transactions relate.

Turning to FIG. 5, a block diagram of reconstructing web page accessesfrom captured HTTP transactions is shown. As shown, Transaction Log 302Amay include various HTTP transactions acquired for client accesses of aserver (e.g., Transaction₁-Transaction₇ in the example of FIG. 5). Webpage access reconstructor module 303 relates each of the transactions ofTransaction Log 302A to the respective web page access to which suchtransaction corresponds, thereby forming web page session log 303A thatcomprises reconstructed client web page accesses. For instance, in theexample of FIG. 5, Transaction₁, Transaction₃, and Transaction₄ arerelated (or grouped) together for a first web page access, Web Access₁.For example, Transaction₁ may be the transaction for downloading an HTMLfile for a web page to the client (such as HTML file 102A in the exampleof FIG. 1), and Transaction₃ and Transaction₄ may each be transactionsfor downloading embedded objects (e.g., images) for the web page (suchas Object₁ and Object₂ in the example of FIG. 1).

Additionally, in the example of FIG. 5, Transaction₂ and Transaction₇are related (or grouped) together for a second web page access, WebAccess₂. For example, Transaction₂ may be the transaction fordownloading an HTML file for a web page to the client and Transaction₇may be a transaction for downloading an embedded object for the webpage. Further, in the example of FIG. 5, Transaction₅ and Transaction₆are related (or grouped) together for a third web page access, WebAccess₃. For example, Transaction₅ may be the transaction fordownloading an HTML file for a web page to the client and Transaction₆may be a transaction for downloading an embedded object for the webpage.

Once the web page accesses are reconstructed in this manner in web pagesession log 303A, end-to-end performance analysis (or other performancemeasurements) may be performed for such web page accesses (e.g.,operational block 405 of FIG. 4). For instance, Transaction₁,Transaction₃, and Transaction₄ may be evaluated to determine theclient-perceived end-to-end performance for Web Access₁ in the manner asdescribed further in U.S. patent application Ser. No. 10/146,967entitled “SYSTEM AND METHOD FOR MEASURING WEB SERVICE PERFORMANCE USINGCAPTURED NETWORK PACKETS”.

As described above in conjunction with FIG. 3, in a preferred embodimentrequest-response reconstructor module 302 reconstructs all TCPconnections and generates a detailed HTTP Transaction Log 302A in whichevery HTTP transaction (a request and the corresponding response) has anentry. Each entry of the Transaction Log 302A is shown generically inthe example of FIG. 5 as “Transaction_(N)”, wherein “N” is the entrynumber. Table 1 below describes in greater detail the format of an entryin HTTP Transaction Log 302A of a preferred embodiment.

TABLE 1 Field Value URL The URL of the transaction. Referer The value ofthe header field Referer, if it exists. Content Type The value of theheader field Content-Type in the responses. Flow ID A unique identifierto specify the TCP connection of this transaction. Source IP Theclient's IP address. Request Length The number of bytes of the HTTPrequest. Response Length The number of bytes of the HTTP response.Content Length The number of bytes of HTTP response body. Request SYNtimestamp The timestamp of the SYN packet from the client. Request Starttimestamp The timestamp for receipt of the first byte of the HTTPrequest. Request End timestamp The timestamp for receipt of the lastbyte of the HTTP request. Start of Response The timestamp when the firstbyte of response is sent by the server to the client End of Response Thetime stamp when the last byte of response is sent by the server to theclient ACK of Response The ACK packet from the client for the lastTimestamp byte of the HTTP response. Response Status The HTTP responsestatus code. Via Field Identification of whether the HTTP field Via isset. Aborted Identification of whether the transaction is aborted.Resent Request Packets The number of packets resent by the client.Resent Response Packet The number of packets resent by the server.

The first field provided in the example Transaction Log entry of Table 1is the URL field, which stores the URL for the HTTP transaction (e.g.,the URL for the object being communicated to the client in suchtransaction). The next field in the entry is the Referer field. Asdescribed above with FIG. 1, typically when a web page is requested, anHTML file 102A is first sent to the client, such as a file “index.html”,which identifies the object(s) to be retrieved for the web page, such asObject₁ and Object₂ in the example of FIG. 1. When the objects for therequested web page (e.g., Object₁ and Object₂) are retrieved by theclient via HTTP transactions (in the manner described above with FIG.1), the Referer field identifies that those objects are embedded in (orare part of) the requested web page (e.g., the objects are associatedwith the index.html file in the above example). Accordingly, whentransactions for downloading various different objects have the sameReferer field, such objects belong to a common web page. The HTTPprotocol defines such a Referer field, and therefore, the Referer fieldfor a transaction may be taken directly from the captured Network Traceinformation for such transaction. More specifically, in the HTTPprotocol, the referer request-header field allows the client to specify,for the server's benefit, the address (URI) of the resource from whichthe Request-URI was obtained (i.e., the “referrer”, although the headerfield is misspelled). The referer request-header allows a server togenerate lists of back-links to resources for interest, logging,optimized caching, etc. In view of the above, the Referer field of atransaction directly identifies the web page to which the object of suchtransaction corresponds.

However, not all HTTP requests for embedded objects contain refererfields. Accordingly, a preferred embodiment uses the Referer field as amain “clue” in reconstructing web page accesses. As described below,embodiments of the present invention utilize additional heuristics toaccurately group transactions into their corresponding web pageaccesses.

The next field provided in the example entry of Table 1 is the ContentType field, which identifies the type of content downloaded in thetransaction, such as “text/html” or “image/jpeg”, as examples. Asdescribed further below, the Content Type field may provide a clue as towhether a transaction is downloading an HTML file, such as “index.html”,for a web page, or whether it is downloading an object, such as animage, JAVASCRIPT, etc., that is embedded in a particular web page.

The next field in the entry is Flow ID, which is a unique identifier tospecify the TCP connection of this transaction. The Flow ID may providea further clue regarding whether a transaction is part of a given webpage access being reconstructed. The next field in the entry is SourceIP, which identifies the IP address of a client to which information isbeing downloaded in the transaction.

The next field in the example entry of Table 1 is the Request Lengthfield, which identifies the number of bytes of the HTTP request of thetransaction. Similarly, the Response Length field is included in theentry, which identifies the number of bytes of the HTTP response of thetransaction. The Content Length field is also included, which identifiesthe number of bytes of the body of the HTTP response (e.g., the numberof bytes of an object being downloaded to a client).

The next field in the example entry of Table 1 is the Request SYNtimestamp, which is the timestamp of the SYN packet from the client. Asdescribed above, when a client clicks a hypertext link (or otherwiserequests a URL) to retrieve a particular web page, the browser firstestablishes a TCP connection with the web server by sending a SYNpacket. If the server is ready to process the request, it accepts theconnection by sending back a second SYN packet acknowledging theclient's SYN. Only after this connection is established can the truerequest for a web page be sent to the server. Accordingly, the RequestSYN timestamp identifies when the first attempt to establish aconnection occurred. This field may be used, for example, in determiningthe latency breakdown for a web page access to evaluate how long it tookfor the client to establish the connection with the server.

The next field in the entry is the Request Start timestamp, which is thetimestamp for receipt of the first byte of the HTTP request of thetransaction. Accordingly, this is the timestamp for the first byte ofthe HTTP request that is received once the TCP connection has beenestablished with the server. The Request End timestamp is also includedas a field in the entry, which is the timestamp for receipt of the lastbyte of the HTTP request of the transaction.

The next field in the entry is the Start of Response field, whichidentifies the timestamp when the first byte of the response is sent bythe server to the client. The entry next includes an End of Responsefield, which identifies the timestamp when the last byte of the responseis sent by the server to the client. The next field in the entry is ACKof Response timestamp, which is the timestamp of the ACK packet(acknowledge packet) from the client for the last byte of the HTTPresponse of the transaction. As described further below, the RequestStart timestamp, Request End timestamp, and ACK of Response timestampfields may be used in measuring the end-to-end performance perceived bythe client for a web page access.

The next field in the example entry of Table 1 is the Response Statusfield, which is the HTTP response status code. For example, the responsestatus code may be a “successful” indication (e.g., status code 200) oran “error” indication (e.g., status code 404). Typically, upon receivinga client's request for a web page (or object embedded therein), the webserver provides a successful response (having status code 200), whichindicates that the web server has the requested file and is downloadingit to the client, as requested. However, if the web server cannot findthe requested file, it may generate an error response (having statuscode 404), which indicates that the web server does not have therequested file.

The next field in the example entry of Table 1 is the Via field, whichis typically set by a proxy of a client. If the client request isreceived by the server from a proxy, then typically proxies add theirrequest field in the Via field. Thus, the Via field indicates that infact its not the original client who requested this file, or who ismaking this request, but rather it is the proxy acting on behalf of theclient.

The next field in the example entry of Table 1 is the Aborted field,which indicates whether the current transaction was aborted. Forexample, the Aborted field may indicate whether the client's TCPconnection for such transaction was aborted. Various techniques may beused to detect whether the client's TCP connection with the server andthe current transaction, in particular, is aborted, such as thosedescribed further in U.S. patent application Ser. No. 10/146,988entitled “SYSTEM AND METHOD FOR RELATING ABORTED CLIENT ACCESSES OF DATATO QUALITY OF SERVICE PROVIDED BY A SERVER IN A CLIENT-SERVER NETWORK”,the disclosure of which is incorporated herein by reference.

The next field in the entry is the Resent Request Packets field, whichprovides the number of packets resent by the client in the transaction.The Resent Response Packet field is the final field in the entry, whichprovides the number of packets resent by the server in the transaction.These fields may provide information about the network status during thetransaction. For instance, if it was necessary for the server to re-sendmultiple packets during the transaction, this may be a good indicationthat the network was very congested during the transaction.

Some fields of the HTTP Transaction Log entry may be used to rebuild webpages, as described further below, such as the URL, Referer, ContentType, Flow ID, Source IP, Request Start timestamp, and Response Endtimestamp fields. Other fields may be used to measure end-to-endperformance for a web page access. For example, the Request Starttimestamp and the Response End timestamp fields can be used together tocalculate the end-to-end response time. The number of resent packets canreflect the network condition. The aborted connection field can reflectthe quality of service, as described further in U.S. patent applicationSer. No. 10/146,988 entitled “SYSTEM AND METHOD FOR RELATING ABORTEDCLIENT ACCESSES OF DATA TO QUALITY OF SERVICE PROVIDED BY A SERVER IN ACLIENT-SERVER NETWORK”, the disclosure of which is incorporated hereinby reference.

As an example of network-level information that may be captured and usedto populate certain of the above fields of Table 1, consider thefollowing example requests and responses (transaction) for retrieving“index.html” page with the embedded image “imgl.jpg” from a web server“www.hpl hp.com”:

Transaction 1: Request: Get/index.html HTTP/1.0 Host: www.hpl.hp.comResponse: HTTP/1.0 200 OK Content-Type: text/html Transaction 2:Request: Get/imgl.jpg HTTP/1.0 Host: www.hpl.hp.com Referer:http://www.hpl.hp.com/index.html Response: HTTP/1.0 200 OK Content-Type:image/jpeg

In the above example, the first request is for the HTML file index.html.The content-type field in the corresponding response shows that it is anHTML file (i.e., content type of “text/html”). Then, the next request isfor the embedded image imgl.jpg. The request header field refererindicates that the image is embedded in index.html. The correspondingresponse shows that the content type for this second transaction is animage in jpeg format (i.e., content type of “image/jpeg”). It should benoted that both of the transactions above have a status “200” (or “OK”)returned, which indicates that they were successful.

As the above example illustrates, the HTTP header field referer, whenset, is a major clue that may be used for grouping objects into theircorresponding web page. However, not all HTTP requests for embeddedobjects contain referer fields. Accordingly, a preferred embodimentutilizes additional heuristics to group objects into web pages, asdescribed further below.

As described above with FIG. 3, in a preferred embodiment, web pageaccess reconstructor module 303 is capable of constructing a Web PageSession Log 303A from the Transaction Log 302A. In order to measure theclient-perceived end-to-end response time for a web page download, itbecomes desirable to identify which objects are embedded in a particularweb page, or in other words, which HTTP transactions correspond to agiven web page access. The response time for client requests retrievingsuch embedded images from a web server may then be measured. In otherwords, to measure the client-perceived end-to-end response time, itbecomes desirable to group the HTTP transactions into their respectiveweb page accesses (as shown in the example of FIG. 5). Once the HTTPtransactions that comprise a particular web page access are determined,the client-perceived end-to-end response time for that particular webpage can be measured.

Although some embedded objects of a web page can be determined byparsing the HTML file using HTML syntax (e.g., for IMG or OBJECTelements), some other embedded objects cannot be easily discovered bystatically parsing and interpreting HTML syntax. For example, JAVASCRIPTis popularly used in web pages to generate some special visual effects.When a browser executes this type of JAVASCRIPT code, the JAVASCRIPToften needs to first download some images. It is difficult to discoverthese implicitly embedded images without executing the JAVASCRIPT.

One technique that may be used for detecting web page accesses uses theclient's think time as the delimiter between two web page accesses. Seee.g. F. Smith, F. Campos, K. Jeffay, and D. Ott, “What TCP/IP ProtocolHeaders Can Tell Us About the Web”, In Proceedings of ACM SIGMETRICS,Cambridge, May, 2001,the disclosure of which is hereby incorporatedherein by reference. This method is simple and useful. However, it mightbe inaccurate in many cases. For example, suppose a client opens two webpages from one server at the same time; in this case, the requests fortwo different web pages interleave each other without any think timebetween them, and therefore the client think time is not a usefuldelimiter for that case. As another example, the interval between therequests for the objects within one page are often too long to bedistinguishable from client think time (perhaps because of the networkconditions), thus making it difficult to delimit web page accesses basedon client think time.

Unlike techniques of the prior art, such as the above-described activeprobing and web page instrumentation techniques, embodiments of thepresent invention provide a technique for passively reconstructing webpage accesses from captured network-level information. Further, certainembodiments of the present invention utilize heuristics to determine theobjects composing the web page (in other words: the content of a webpage) and apply statistics to adjust the results, as described furtherbelow.

As described above, the request-response reconstructor module 302reconstructs all TCP connections and generates a detailed HTTPTransaction Log 302A in which every HTTP transaction (a request and thecorresponding response) has an entry. Table 1 above describes the formatof an entry in the HTTP Transaction Log 302A of a preferred embodiment.As described above with FIG. 4, the operational flow of a preferredembodiment relates HTTP transactions from Transaction Log 302A to theircorresponding web page accesses (see block 404 of FIG. 4).

To group the transactions into web page accesses in a preferredembodiment, the following fields from the entries in Transaction Log302A are used:

-   -   request URL,    -   request Referer field,    -   request Content Type,    -   response Status Code,    -   request Via header field,    -   Flow ID (TCP connection ID),    -   Source (client) IP address,    -   Request Start timestamp, and    -   Response End timestamp.

FIG. 6 shows an example operational flow diagram for relating thetransactions of Transaction Log 302A to web page accesses in accordancewith a preferred embodiment of the present invention. In operationalblock 601, attention is directed to the first entry of Transaction Log302A. In block 602, it is determined whether the entry being consideredhas a successful response type. That is, the Status Code field for theentry is evaluated to determine whether the transaction for this entrywas successful (e.g., having status code “200” sent in the response fromthe web server). If it is determined in block 602 that the transactionwas not successful, then execution advances to block 604 and the entryis excluded from further consideration (e.g., removed from TransactionLog 302A).

If it is determined in block 602 that the transaction was successful,then execution advances to block 603 whereat a determination is made asto whether this entry is for a transaction from a web proxy. This is,the Via field for the entry is evaluated to determine whether thetransaction for this entry was sent from a web proxy. According toHTTP/1.1, the via header field is to be used by gateways and proxies toindicate the intermediate protocols and recipients between the clientand the server. If it is determined in block 603 that the transactionwas from a proxy, then execution advances to block 604 and the entry isexcluded from further consideration (e.g., removed from Transaction Log302A).

Thereafter, execution advances to block 605 and a determination is madeas to whether more entries exist in Transaction Log 302A. If moreentries do exist, then attention is directed to the next entry ofTransaction Log 302A at block 606, and execution returns to block 602 toevaluate whether this next entry was successful and whether it was froma web proxy. Accordingly, a preferred embodiment analyzes thetransactions collected in Transaction Log 302A to exclude transactionsthat were unsuccessful, as well as transactions from web proxies, fromfurther consideration in reconstructing client web page accesses.

Once it is determined at block 605 that no more entries exist inTransaction Log 302A (i.e., all entries have been examined to detectunsuccessful and transactions issued by web proxies), the entries aresorted by their Request Start timestamp field (in increasing time order)in operational block 607. Thus, once this sorting is completed, therequests for the embedded objects of a web page must follow the requestfor the corresponding HTML file. That is, it is known after the sortingthat any requests for embedded objects of a web page are arranged inTransaction Log 302A after their corresponding HTML file. Of course,after such sorting, the transactions from different clients may still beinterleaved with each other in Transaction Log 302A. Even for oneclient, the transactions for a single web page may be interleaved withthe transactions for other pages.

Accordingly, in operational block 608, the sorted Transaction Log 302Ais scanned in order to group the transactions into their correspondingweb page accesses. To perform this grouping of transactions into theircorresponding web page accesses, in a preferred embodiment, thetransactions from Transaction Log 302A are organized into a hash table,such as the example hash table shown in FIG. 7. The example hash tableof FIG. 7 maps a client's IP address to a web page table containing allweb pages accessed by the client (this data structure is referred toherein as Web Page Session Log 303A). Each entry in the web page tableis composed of an HTML file and some embedded objects, or a single,independent object that does not include any further embedded objectstherein.

For example, as shown in FIG. 7, Web Page Session Log 303A includesclient IP addresses 701, such as IP₁, IP₂, IP₃, . . . , IP_(n). Eachclient's IP address is mapped to a web page table that identifies allweb pages accessed by the client in the transactions of Transaction Log302A. For instance, in the example of FIG. 7, client address IP₁ ismapped to web page table 702 that identifies the web pages accessed bysuch client. More specifically, in this example, web page table 702comprises three web pages that have been accessed by client IP₁. Thefirst web page accessed by IP₁ comprises HTML file HTML₁ with embeddedobjects 703A-703C. The second web page accessed by IP₁ comprises HTMLfile HTML₂ with embedded object 704A, and the third web page accessed byIP₁ comprises HTML file HTML₃ with embedded objects 705A and 705B.

The operation of grouping the web page objects (or transactions) intocorresponding client web page accesses in order to form Web Page SessionLog 303A in accordance with a preferred embodiment is now described inconjunction with the example operational flow shown in FIG. 8. Asdescribed above with operational block 608 of FIG. 6, in a preferredembodiment, web page access reconstructor module 303 groups transactionsfrom Transaction Log 302A into web page accesses (thereby forming WebPage Session Log 303A). FIG. 8 shows operational block 608 of FIG. 6 infurther detail. The operation of FIG. 8 is executed for each entry ofTransaction Log 302A in order to construct such transactions intocorresponding web page accesses. In processing an entry of TransactionLog 302A, web page access reconstructor module 303 first locates a hashentry for the client's IP address. That is, the Source IP field for theentry (see Table 1) may be used to identify the client's IP address,which is used to locate the corresponding hash table entry (e.g., inclient IP addresses 701 of FIG. 7).

In a preferred embodiment, web page access reconstructor module 303handles the transaction differently depending on its content type. Thatis, web page access reconstructor module 303 evaluates the transaction'scontent type in determining a web page access to which the transactioncorresponds. For instance, in the example operational flow of FIG. 8,web page access reconstructor module 303 determines whether thetransaction's content type is “text/html” in block 802. Thetransaction's content type is indicated by the Content Type field inTransaction Log 302A. If it is determined that the transaction's contenttype of a transaction is text/html, web page access reconstructor module303 treats it as the beginning of a web page. Accordingly, at block 803,web page access reconstructor module 303 creates a new web page entry inthe web page table (e.g., table 702 of FIG. 7) for the client's IPaddress in Web Page Session Log 303A. That is, the transaction beingevaluated is identified in Web Page Session Log 303A as being the HTMLfile for a new web page access for the client.

Some content types are well-known as independent objects that cannotcontain any embedded objects, such as content types:application/postscript, application/x-tar, application/pdf,application/zip, and text/plain. Accordingly, if it is determined inblock 802 that the transaction's content type is not “text/html”,execution advances to block 804 whereat it is determined whether thetransaction's content type indicates that the transaction was for anindependent object that does not include embedded objects. If it isdetermined that the transaction's content type indicates that thetransaction was for an independent object (e.g., the transaction'scontent type is application/postscript, application/x-tar,application/pdf, application/zip, or text/plain, as examples), thenexecution advances to block 805 whereat web page access reconstructormodule 303 marks the transaction as being for an individual web pagethat does not include any embedded objects. Web page accessreconstructor module 303 further allocates a new web page entry in theclient's web page table in Web Page Session Log 303A for thistransaction.

As for transactions having content types other than “text/html” andother than an independent object type, it is determined that suchtransactions may involve objects that are embedded in a web page.Accordingly, web page access reconstructor module 303 attempts toassociate such objects with the corresponding web page in which they areembedded. One technique that is utilized for attempting to relate thetransaction for an embedded object with its corresponding web pageutilizes the Referer field for the transaction. Thus, if it isdetermined at block 804 that the transaction's content type does notindicate that the object is an independent object, execution advances toblock 806 whereat it is determined whether the Referer field is set forthe transaction. If the header field referer is set for thistransaction, web page access reconstructor module 303 can utilize thisinformation in determining which web page the object belongs to.

If the Referer field is set for the transaction, execution advances toblock 807 to determine whether the referred HTML file is an existingentry in the web page table for this client. That is, it is determinedwhether the value of the transaction's Referer field identifies anexisting HTML file entry in the client's web page table. If the referredHTML file is an existing entry in the web page table, then web pageaccess reconstructor module 303 appends this transaction to suchexisting web page entry in the client's web page table in block 808.However, if the referred HTML file is not an existing entry in theclient's web page table, it means that the client accessed this web pagewithout accessing the HTML file, which may be cached somewhere betweenthe client and the web server. In this case, web page accessreconstructor module 303 creates a new entry in the client's web pagetable for the referred HTML file and marks it as nonexistent. Web pageaccess reconstructor module 303 then appends the considered object tothe newly created web page entry in the client's web page table.

If it is determined at block 806 that the header field referer is notset for this transaction, execution advances to block 810 whereat webpage access reconstructor module 303 searches the client's web pagetable for a web page accessed via the same Flow ID as this transaction.That is, the Flow ID field for this transaction in Transaction Log 302Ais compared against the Flow IDs for entries in the client's web pagetable to determine whether a web page entry in the client's web pagetable has the same Flow ID as this transaction. At block 811, web pageaccess reconstructor module 303 determines whether a web page entryhaving the same Flow ID as this transaction is found in the client's webpage table. If a web page entry in the client's web page table has thesame Flow ID as the transaction under consideration, it is a highlikelihood that the transaction under consideration corresponds to theweb page entry having the same Flow ID.

To improve the results of this technique, web page access reconstructormodule 303 of a preferred embodiment also adopts a configurable “thinktime threshold” to delimit web pages. If the time gap between thetransaction and the tail of the web page to which it tries to append islarger than the think time threshold, web page access reconstructormodule 303 does not append the transaction to the web page andeliminates this transaction from further consideration. Accordingly, inblock 812, it is determined whether the time gap between the transactionand the tail of the web page that has the same Flow ID as thetransaction under consideration exceeds the think time threshold. If thethink time threshold is not exceeded, then the transaction is appendedto the found web page entry having the same Flow ID at block 813. Thatis, it is determined that this transaction is part of the web pageaccess that has the same Flow ID.

Otherwise, if it is determined at block 812 that the think timethreshold is exceeded or if it is determined at block 811 that a webpage having the same Flow ID as the transaction under consideration isnot found, execution advances to block 814 whereat web page accessreconstructor module 303 determines the latest accessed web page in theclient's web page table. That is, web page access reconstructor module303 determines the web page in the client's web page table having theclosest Request End timestamp to that of the transaction being examined.In a preferred embodiment, a think time threshold is again used inoperational block 815 to determine whether it is appropriate to appendthe transaction to the latest accessed web page. If the time gap betweenthe transaction and the tail of the web page to which it tries to appendis larger than the think time threshold, web page access reconstructormodule 303 does not append the transaction to the web page andeliminates this transaction from further consideration. Accordingly, inblock 815, it is determined whether the time gap between the transactionand the tail of the web page exceeds the think time threshold. If thethink time threshold is not exceeded, then the transaction is appendedto the latest accessed web page in the client's web page table inoperational block 816. Otherwise, this transaction is dropped fromfurther consideration at operational block 817. That is, web page accessreconstructor module 303 does not utilize this entry in constructing WebPage Session Log 303A.

The heuristic approach for reconstructing client web page accessesdescribed above with FIG. 8 is not guaranteed to always correctlyreconstruct web page accesses. However, certain methods may be employedto improve the results of this approach (i.e., to reduce the percentageof incorrectly reconstructed web page accesses). In a preferredembodiment, statistical analysis of observed web page access patterns isutilized to more exactly determine the content of the web pagesaccessed, which helps to improve the results of reconstructed web pageaccesses. For instance, the content of web pages may be determined,which may aid in determining whether the appropriate transactions arecorrectly grouped together for reconstructing a web page access.

Accordingly, once the hash table of web page accesses (or “Web PageSession Log 303A”) is generated from the operational flow of FIG. 8, webpage access reconstructor module 303 of a preferred embodiment utilizessuch hash table to attempt to determine the content of any given webpage in the manner described hereafter in conjunction with the exampleoperational flow diagram of FIG. 9. As shown in FIG. 9, web page accessreconstructor module 303 first collects all possible access patterns fora given web page from the hash table (or Web Page Session Log 303A) inoperational block 901. In operational block 902, web page accessreconstructor module identifies the probable content of a web page asbeing the combined set of all objects that appear in different accesspatterns for such web page.

Thus, web page access reconstructor module 303 scans the web page hashtable of FIG. 7 and creates a new hash table mapping from URLs to theobjects embedded in the URLs. Table 2 below provides an example hashtable mapping of URLs to the objects embedded in the URLs for aparticular web page identified in Web Page Session Log 303A.

TABLE 2 Web page Probable Content table (there are 3075 accesses forthis page). Index URL Frequency Ratio (%) 1 /index.html 2937 95.51 2/img1.gif  689 22.41 3 /img2.gif  641 20.85 4 /log1.gif   1  0.03 5/log2.gif   1  0.03

In creating the new hash table mapping, such as the example of Table 2,web page access reconstructor module 303 assigns an index for eachobject that appears in the access patterns for a web page. The columnURL of Table 2 identifies the URLs of the objects that appear in theaccess patterns for the web page. In operational block 903, web pageaccess reconstructor module 303 computes the frequency at which eachobject occurs in all accesses of the web page being considered. That is,web page access reconstructor module 303 maintains a count for eachobject grouped for the web page under consideration in the accesspatterns. The column Frequency of Table 2 identifies the frequency of anobject in the set of all web page accesses in the hash table.

Table 2 above shows an example of a new hash table built as a probablecontent of a given web page identified from the web page accesses. InTable 2, the indices are sorted by the frequencies of the occurrences ofthe objects. The column Ratio is the percentage of the object's accessesin the total accesses for the page.

Before computing the statistics of the access patterns for web pages,web page access reconstructor module 303 of a preferred embodiment alsoattempts, in operational block 904, to merge the requests for the samepage with different URL expressions. For example, the following URLscould point to the same web page:

-   (a) http://www.hpl.hp.com-   (b) http://www.hpl.hp.com/-   (c) http://www.hpl.hp.com/index.htm-   (d) http://www.hpl.hp.com/index.html.

In a preferred embodiment, web page access reconstructor module 303merges the accesses for URLs that point to the same web page. Web pageaccess reconstructor module 303 may use the probable content of theseURLs to determine whether they indicate the same web page. As discussedin the following paragraph, the heuristic method for grouping objectsinto web page accesses may cause some objects to be grouped by mistake.So, even if two URLs are for the same web page, their probable contentmay include different objects. However, the proportion of the accessesto these objects should be relatively small, (e.g., below a certainthreshold, such as 1%), and therefore the web page access reconstructor303 may ignore them when merging URLs.

The heuristic methods used by web page access reconstructor module 303for grouping transactions into web page accesses may introduce someinaccuracy, and some access patterns collected by web page accessreconstructor module 303 may include objects that in reality do notbelong to the web page. To adjust/improve the grouping results of webpage access reconstructor module 303, statistics are used in a preferredembodiment to more precisely determine the content of web pages from thepage access patterns. More specifically, based on a web page's probablecontent table, as demonstrated in Table 2, web page access reconstructormodule 303 uses the indices of objects in Table 2 to describe the accesspatterns for a given web page.

Table 3 below provides an example set of different combinations ofobserved access patterns for the web page considered in the example ofTable 2. Each row of Table 3 is an access pattern. The column ObjectIndices shows the indices of the objects accessed in a pattern. Inoperational block 905, web page access reconstructor module 303 computesstatistics for the various observed access patterns for a web page, suchas the Frequency and Ratio statistics of Table 3. The columns Frequencyand Ratio are the number of accesses and the percentage of a pattern inthe total number of all accesses for the web page. For example, pattern1 is a pattern in which only the object /index.html is accessed. It isthe most popular access pattern for this web page. In the example ofTable 3, 2,271 accesses out of the total 3,075 accesses represent thispattern. In pattern 2, the objects /index.html, /img1.gif and /img2.gifare accessed.

TABLE 3 Web page access patterns (with a total of 3,075 accesses) IndexObject Indices Frequency Ratio (%) 1 1 2271 73.85 2 1, 2, 3 475 15.45 31, 2 113 3.67 4 1, 3 76 2.47 5 2, 3 51 1.66 6 2 49 1.59 7 3 38 1.24 8 1,2, 4 1 0.03 9 1, 3, 5 1 0.03

For any given web page, web page access reconstructor module 303 of apreferred embodiment attempts to further estimate the web page's truecontent with the help of the statistics for the web page accesspatterns. Intuitively, if an access pattern is collected by web pageaccess reconstructor module 303 by mistake, it is unlikely for thispattern to appear many times. For example, patterns 8 and 9 in theexample of Table 3 each occur only once. This might be caused by a proxythat accessed the objects consecutively in a very short time period, andthe Via field in the request header was not set up, for example. Anotherreason for this mistake can be that web page access reconstructor module303 may attach the objects that do not have a referer field to a wrongweb page if these objects were accessed in a reasonably short time afterthis web page (as described in the example flow diagram of FIG. 8).

In a preferred embodiment, web page access reconstructor module 303 usesa configurable ratio threshold to exclude the mistaken patterns, e.g., aratio threshold of 1%. If the ratio of a pattern is below the threshold,web page access reconstructor module 303 does not consider it as a validpattern. In practice, the threshold can be adjusted according to thenumber of accesses for the web page. For instance, if the number ofaccesses is small, a large threshold is preferably adopted since themistaken patterns occupy a much larger ratio in this case. So, patterns8 and 9 in the example of Table 3 are not considered as valid accesspatterns.

Accordingly, in operational block 906 of FIG. 9, attention is directedto the first access pattern observed in the hash table for the web pageunder consideration. A determination is made at block 907 whether thepattern's ratio is below a predefined ratio threshold. If it isdetermined that the pattern's ratio is below the threshold, then it isdetermined, at block 908, that this is an invalid pattern. Otherwise, itis determined, at block 909, that the pattern under consideration isvalid. At block 910 it is determined whether more access patterns forthe web page under consideration exist in the hash table. If it isdetermined that more such access patterns do exist, then attention isdirected to the next access pattern for the web page at block 911, andexecution then returns to block 907 to determine whether this nextaccess pattern is valid.

Once it is determined at block 910 that no further access patterns existin the hash table for the web page under consideration, executionadvances to block 912 whereat the object(s) of valid access patterns aredetermined to be the true content of the web page under consideration.That is, only the objects found in the valid access patterns areconsidered as being the true embedded objects of a given web page. Thus,objects having indices 1, 2, and 3 in the above example of Table 3 aredetermined to define the true content of the web page underconsideration, as shown in Table 4 below.

TABLE 4 Web page's true content (with a total of 3075 accesses) IndexURL Frequency Ratio (%) 1 /index.html 2937 95.51 2 /img1.gif 689 22.41 3/img2.gif 641 20.85

A web page access from Web Page Session Log 303A is considered to bevalid if all the objects from the page's accesses are contained in thetrue content for a given web page. Accordingly, the reconstructed webpage accesses of Web Page Session Log 303A that are determined to bevalid are considered to be reliable/accurate. Thus, the validreconstructed web page accesses provide great accuracy in representingclient web page accesses, which may be used, for example, in evaluatingclient-perceived end-to-end performance in certain embodiments. In apreferred embodiment, the end-to-end response time statistics may becomputed using the web page accesses from Web Page Session Log 303A thatare determined to be valid.

The first typical metrics of interest for service providers is theaverage end-to-end time observed by the clients when downloading a givenpage during a particular time interval. A more advanced analysis of theend-to-end time includes computation of the whole distribution ofend-to-end times observed by clients when downloading a given pageduring a particular time interval. Such distribution allows, forexample, the following to be computed: A) a percentage of clients with a“good site experience” (i.e. their observed response time was below apredefined targeted threshold), and B) a percentage of clients with a“poor site experience” (i.e. their observed response time was above thetargeted threshold).

In a preferred embodiment, such performance metrics may be computed fora given web page. For example, the web page hash table of FIG. 7 may bescanned, and the end-to-end response time may be computed for entries inthe table whose objects are in the subset of the true content of a theweb page under consideration. The end-to-end response time for a givenweb page access is defined as a difference between the earliest andlatest timestamps among the page related objects in the correspondingweb page hash table entry. Various other performance metrics may becomputed for the valid reconstructed web page accesses to provide theservice provider with an understanding of the client-perceivedperformance of the services (e.g., web page accesses).

It should be understood that the modules of FIG. 3 for reconstructingweb page accesses may be deployed in several different ways on theserver side of a client-server network. As used herein, the “serverside” of a client-server network is not intended to be limited solely tothe server itself, but is also intended to comprise any point in theclient-server network at which all of the traffic “to” and “from” theserver (e.g., a web server cluster or a particular web server in acluster) that is used to support the monitored web site (or other typeof monitored information that is accessible by a client) can be observed(e.g., to enable capture of the network packets communicated to/from theserver). Various examples of server-side implementations are describedherein below. As one example, the modules may be implemented as anindependent network appliance for reconstructing web page accesses (and,in certain implementations, measuring end-to-end performance). Anexample of such a network appliance implementation is shown in FIG. 10.As shown, one or more servers 101 (e.g., servers 101A-101D of FIG. 1)may be provided for serving information (e.g., web pages) to one or moreclients 104 (e.g., clients 104A-104D of FIG. 1) via communicationnetwork 103. Web page access reconstructor appliance 1000 may bearranged at a point in communication network 103 where it can captureall HTTP transactions for server(s) 101, e.g., the same subnet ofserver(s) 101. In this implementation, web page access reconstructorappliance 1000 should be arranged at a point in network 103 wheretraffic in both directions can be captured for server(s) 101: therequest traffic to server(s) 101, and the response traffic fromserver(s) 101. Thus, if a web site consists of multiple web servers 101,web page access reconstructor appliance 1000 should be placed at acommon entrance and exit of all such web servers 101.

If a web site is supported by geographically distributed web servers,such a common point may not exist in network 103. However, mosttypically, web servers in a web server farm (or cluster) use “stickyconnections”, i.e., once the client, has established a TCP connectionwith a particular web server, the consequent client's requests are sentto the same server. In this case, implementing web page accessreconstructor appliance 1000 can still be used to capture a flow oftransactions (to and from) a particular web server 101, representing apart of all web transactions for the web site, and the measured data canbe considered as sampling.

As another example of how the modules of FIG. 3 may be deployed, theymay be implemented as a software solution deployed on a web server. Anexample of such a software solution is shown in FIG. 11. As shown,server 101 may be provided for serving information (e.g., web pages) toone or more clients 104 via communication network 103. Web page accessreconstructor software 1100 may be implemented as a software solution atserver 101, and used for reconstructing transactions (and, in certainimplementations, measuring end-to-end performance) at this particularserver.

If a web site consists of multiple web servers, then as in the previouscase, this software solution still can work when each web server isusing “sticky connections.” In this case, the software solution 1100 canbe installed at a randomly selected web server 101 in the overall siteconfiguration, and the measured data can be considered as sampling.

As another example of how the modules of FIG. 3 may be deployed, theymay be implemented as a mixed software solution with some modulesdeployed on a web server and some modules deployed on an independentnode, outside of a web server complex. An example of such a mixedsoftware solution is shown in FIG. 12. As shown, server 101 may beprovided for serving information (e.g., web pages) to one or moreclients 104 via communication network 103. A portion of the web pageaccess reconstructor solution (e.g., certain modules) may be implementedat server 101, and the rest (e.g., the remaining modules) may beimplemented at an independent node.

For example, to minimize the performance impact of additionalcomputations on server 101, only two modules are deployed at server 101in the example of FIG. 12, i.e., network packets collector module 301and request-response reconstructor module 302. The outcome ofrequest-response reconstructor module 302 is a Transaction Log 302A thatis preferably two orders of magnitude smaller than the original NetworkTrace 301A. Such Transaction Log 302A is transferred to a different,independent node 1201 installed with web page access reconstructormodule 303 and, in some implementations, performance analysis module304. These modules process the Transaction Logs received from webserver(s) 101 to reconstruct web page accesses and, in certainimplementations, generate performance analysis (e.g., end-to-endperformance measurements).

It should be noted that in each of the implementations described abovein FIGS. 10-12, the solutions exclude from consideration the encryptedconnections whose content cannot be analyzed, and hence, the HTTP levelinformation cannot be extracted. That is, because embodiments of thepresent invention capture network-level information and utilize suchnetwork-level information for reconstructing web page accesses,encrypted connections are not analyzed.

When implemented via computer-executable instructions, various elementsof the present invention, such as modules 301-304 of FIG. 3, are inessence the software code defining the operations of such variouselements. The executable instructions or software code may be obtainedfrom a readable medium (e.g., a hard drive media, optical media, EPROM,EEPROM, tape media, cartridge media, flash memory, ROM, memory stick,and/or the like) or communicated via a data signal from a communicationmedium (e.g., the Internet). In fact, readable media can include anymedium that can store or transfer information.

FIG. 13 illustrates an example computer system 1300 adapted according toembodiments of the present invention. Central processing unit (CPU) 1301is coupled to system bus 1302. CPU 1301 may be any general purpose CPU.Suitable processors include without limitation INTEL's PENTIUM® 4processor, for example. However, the present invention is not restrictedby the architecture of CPU 1301 as long as CPU 1301 supports theinventive operations as described herein. CPU 1301 may execute thevarious logical instructions according to embodiments of the presentinvention. For example, CPU 1301 may execute machine-level instructionsaccording to the exemplary operational flows described above inconjunction with FIGS. 4, 6, 8, and 9.

Computer system 1300 also preferably includes random access memory (RAM)1303, which may be SRAM, DRAM, SDRAM, or the like. Computer system 1300may utilize RAM 1303 to store the Network Trace 301A, Transaction Log302A, and/or Web Page Session Log 303A, as examples. Computer system1300 preferably includes read-only memory (ROM) 1304 which may be PROM,EPROM, EEPROM, or the like. RAM 1303 and ROM 1304 hold user and systemdata and programs as is well known in the art.

Computer system 1300 also preferably includes input/output (I/O) adapter1305, communications adapter 1311, user interface adapter 1308, anddisplay adapter 1309. I/O adapter 1305 and/or user interface adapter1308 may, in certain embodiments, enable a user to interact withcomputer system 1300 in order to input information (e.g., for specifyingconfigurable variables, such as the configurable think time thresholddescribed with FIG. 8).

I/O adapter 1305 preferably connects to storage device(s) 1306, such asone or more of hard drive, compact disc (CD) drive, floppy disk drive,tape drive, etc. to computer system 1300. The storage devices may beutilized when RAM 1303 is insufficient for the memory requirementsassociated with storing data for reconstructing web page accesses.Communications adapter 1311 is preferably adapted to couple computersystem 1300 to network 103. User interface adapter 1308 couples userinput devices, such as keyboard 1313, pointing device 1307, andmicrophone 1314 and/or output devices, such as speaker(s) 1315 tocomputer system 1300. Display adapter 1309 is driven by CPU 1301 tocontrol the display on display device 1310.

It shall be appreciated that the present invention is not limited to thearchitecture of system 1300. For example, any suitable processor-baseddevice may be utilized, including without limitation personal computers,laptop computers, computer workstations, and multi-processor servers.Moreover, embodiments of the present invention may be implemented onapplication specific integrated circuits (ASICs) or very large scaleintegrated (VLSI) circuits. In fact, persons of ordinary skill in theart may utilize any number of suitable structures capable of executinglogical operations according to the embodiments of the presentinvention.

While various examples are described herein for reconstructing web pageaccesses from captured network packets, it should be understood that thepresent invention is not so limited. Rather, certain embodiments of thepresent invention may applied to reconstruct other types of clientaccesses of a server in a client-server network through captured networkpackets. For example, a client may access information from one or moreclients in a client-server network, and embodiments of the presentinvention may be implemented to utilize captured network packets forsuch information access to reconstruct the client access in order, forexample, to measure the client-perceived end-to-end performance inreceiving the information from the server(s). Particularly if theinformation being accessed is retrieved by the client from the serverthrough a plurality of transactions, embodiments of the presentinvention may be utilized to group the corresponding transactionstogether for their respective client accesses of the information.

1. A computer implemented method for reconstructing client web pageaccesses, said method comprising: capturing network-level informationfor client accesses of at least one web page; and using the capturednetwork-level information to reconstruct a structure of said clientaccesses of said at least one web page by identifying transactions forsaid client accesses from the captured network-level information andrelating said transactions to their corresponding client accesses,wherein said structure does not comprise payload of said at least oneweb page.
 2. The method of claim 1 wherein said using the capturednetwork-level information to reconstruct said structure of said clientaccesses of said at least one web page comprises relating transactionsbetween a client and a server to their corresponding web page access. 3.The method of claim 1 wherein said step of capturing network-levelinformation captures said network-level information on a server-side ofa communication network used by said client to access said at least oneweb page.
 4. The method of claim 1 wherein said step of capturingnetwork-level information comprises: capturing said network-levelinformation for a plurality of transactions.
 5. The method of claim 4wherein each of said plurality of transactions comprises a request fromsaid client to a server and a response to said client from said server.6. The method of claim 4 wherein said step of using the capturednetwork-level information to reconstruct said structure of said clientaccesses of said at least one web page comprises: determining arespective web page access to which each of said plurality oftransactions corresponds.
 7. The method of claim 1 further comprising:compiling a log of reconstructed client web page accesses and evaluatingthe frequency at which a given object of a web page appears in thecompiled log to identify inaccuracies in the reconstructed structure ofsaid client web page accesses.
 8. The method of claim 1, wherein thenetwork-level information comprises network packets, and the methodfurther comprises: extracting header information from the identifiedtransactions without extracting the payload of the network packets; andstoring the extracted header information from the identifiedtransactions in a data structure.
 9. The method of claim 8, furthercomprising: grouping the transactions into sessions of correspondingclient accesses, wherein the grouping is based on at least part of theextracted header information.
 10. The method of claim 1, whereinrelating said transactions to their corresponding client accessescomprises: providing, in the structure, a mapping between an address ofa client and web pages accessed by the client.
 11. The method of claim10, wherein providing the mapping between the client address and aparticular one of the web pages comprises providing the mapping betweenthe client address and (1) a markup language file associated with theparticular web page and (2) objects embedded in the particular web page.12. The method of claim 11, further comprising: collecting possibleaccess patterns for the particular web page, wherein each access patternincludes a corresponding combination of any one or more of the markuplanguage file and the objects embedded in the particular web page;determining statistics associated with the possible access patterns; andidentifying at least one mistaken access pattern based on thestatistics.
 13. The method of claim 12, further comprising: determiningtrue content of the particular web page based on the possible accesspatterns excluding the at least one mistaken access pattern.
 14. Themethod of claim 1, further comprising: using the reconstructed structureof said client accesses of said at least one web page to measure anend-to-end performance in receiving the at least one web page by aclient.
 15. A computer implemented method for reconstructing client webpage accesses, said method comprising: capturing network-levelinformation for client accesses of at least one web page; and using thecaptured network-level information to reconstruct a structure of saidclient accesses of said at least one web page, wherein said structuredoes not comprise payload of said at least one web page wherein saidstep of capturing network-level information comprising capturing saidnetwork-level information for a plurality of transactions, wherein saidstep of using the captured network-level information to reconstruct saidstructure of client accesses comprises at least one of the following:using content information included in said captured network-levelinformation for a transaction that identifies the type of content of thetransaction to determine a client web page access to which thetransaction corresponds, and using information included in said capturednetwork-level information for a transaction that directly identifies aweb page to which the content of the transaction corresponds todetermine a client web page access to which the transaction corresponds.16. A computer implemented method for reconstructing client informationaccesses, said method comprising: capturing network-level informationincluding network packets for client accesses of information from aserver, wherein said capturing network-level information is performed ona server-side of a client-server network; identifying transactions fromthe captured network packets; extracting header information of theidentified transactions by extracting HTTP header information; andaccording to the extracted header information, relating said identifiedtransactions to their corresponding client accesses of information fromsaid server by grouping the identified transactions into respectivesessions corresponding to the client accesses based on at least some ofthe following extracted HTTP header information: request URL, requestReferer field, request content type, response status code, request viaheader field, flow identifier, client IP address, request starttimestamp, and response end timestamp, wherein each client access ofsaid information comprises a plurality of transactions.
 17. The methodof claim 16 wherein capturing the network packets comprises capturingthe network packets from a packet-switched communication networkcommunicatively coupling said client and said server.
 18. The method ofclaim 16 wherein each of said transactions comprises a request from acorresponding client to said server and a response from said server tosaid corresponding client.
 19. The method of claim 16 wherein said stepof capturing network-level information comprises compiling a transactionlog of network-level information captured for each of said plurality oftransactions.
 20. The method of claim 16 wherein said relating stepcomprises: evaluating the extracted header information for each of saididentified transactions to determine the corresponding client access towhich the transaction corresponds.
 21. The method of claim 16 whereinsaid information from the server comprises a web page.
 22. The method ofclaim 16, further comprising: based on relating the identifiedtransactions to their corresponding client accesses of information fromsaid server, measuring an end-to-end performance in receivinginformation from the server by a client.
 23. A system for reconstructingclient web page accesses, said system comprising: server forcommunicating at least one web page to clients via a communicationnetwork to which said server is communicatively coupled;computer-executable software code for capturing network-levelinformation for client accesses of said at least one web page; andcomputer-executable software code for extracting non-payload informationfrom the captured network-level information and reconstructing, from theextracted non-payload information, a structure representing said clientaccesses of said at least one web page wherein a client access of saidat least one web page comprises a plurality of transactions, and whereinsaid computer-executable software code for reconstructing said structurerepresenting said client accesses of said at least one web page furthercomprises computer-executable soft-ware code for relating said pluralityof transactions to their corresponding client web page accesses based atleast in part on said captured network-level information for saidplurality of transactions, wherein said computer-executable softwarecode for capturing and said computer-executable software code forreconstructing are stored to a computer-readable medium.
 24. The systemof claim 23 wherein said computer-executable software code for capturingnetwork-level information executes on said server.
 25. The system ofclaim 23, further comprising: computer-executable software code formeasuring end-to-end performance in receiving the at least one web pagefrom the server by a client, based on the structure representing saidclient accesses of said at least one web page.
 26. A computerimplemented method for reconstructing client web page accesses, saidmethod comprising: capturing network-level information for a clientaccess of at least one web page, wherein a server makes available the atleast one web page to a plurality of different clients via aclient-server network and wherein the capturing is performed at a pointon a server side of the client-server network through whichcommunication between the server and the plurality of different clientsflows; identifying transactions of said client access from the capturednetwork-level information wherein the captured network-level informationcontains network-level information for a plurality of different clientaccesses of said at least one web page that are interleaved with eachother; extracting header information of the identified transactions; andreconstructing, from the extracted header information, said clientaccess of said at least one web page.
 27. The method of claim 26 furthercomprising: reconstructing, from the extracted header information of theidentified transactions, each of said plurality of different clientaccesses of said at least one web page.
 28. The method of claim 26wherein said reconstructing comprises: reconstructing a structure ofsaid client access of said at least one web page, wherein said structuredoes not comprise payload of said at least one web page.
 29. A computerimplemented method for reconstructing client web page accesses, saidmethod comprising: capturing network-level information for a clientaccess of at least one web page, wherein a server makes available the atleast one web page to a plurality of different clients via aclient-server network and wherein the capturing is performed at a pointon a server side of the client-server network through whichcommunication between the server and the plurality of different clientsflow; and reconstructing, from the captured network-level information,said client access of said at least one web page wherein saidreconstructing comprises: extracting non-payload information from thecaptured network-level information; and using the extracted non-payloadinformation for performing said reconstructing of said client access ofsaid at least one web page.
 30. A system for reconstructing client webpage accesses, said system comprising: server for communicating at leastone web page to clients via a communication network to which said serveris communicatively coupled; computer-executable software code forcapturing network-level information for client accesses of said at leastone web page wherein the captured network-level information comprisesnetwork packets, computer-executable software code for extractingnon-payload information from the captured network-level information andreconstructing, from the extracted non-payload information, a structurerepresenting said client accesses of said at least one web page, whereinsaid computer-executable software code for capturing and saidcomputer-executable software code for reconstructing are stored to acomputer-readable medium; and computer-executable software code for:identifying transactions of the of client accesses of said at least oneweb page from the captured network packets; extracting headerinformation from the identified transactions; based on at least a partof the extracted header information, grouping the transactions intosessions of corresponding client accesses.